Personal Challenge

Challenge description:

I decided to work on the topic of agriculture in India, this being I am interested in trying to see if with my model's final solution can later be enhanced to give small farmers in under developed countries like mine a chance to have a better agricultural system
But at this stage, form this project I would like to gather alot of the skills that is involved in creating an AI via the use of machine learning (supervised learning), Exploratory data analysis approach and Ethics(Societal impact).

Challenge predictive goal:

My predictive goal is to predict how much a farmer will produce based on some input, thereby helping them have a good idea on the best periods that will yield a good amount of crops/produce.

Student Information:

Name: Osuntuyi Michael

Data set information

Data collection: The data was gathered from
https://data.gov.in/

Dataset description:
This dataset contains information about state and districts in india how much crop they produced, the name of the crop, in what year did they yield the inputted amount production(tonne), how much area(hectare) of land was required to acheive the amount podueced,

Here I input new names for Area and Production, this is because I would like to know at all times
the unit of measurement for these columns

Importing dataset gotten fromt eh india government dataset official website

After importing the dataset, I noticed row index 0 contains the column names as values
therefore below I would be removing this row

After dropping the rows, I noticed the index values arent in the right order anymore, therefore I decided to properly arrange the index values

Trying to group the data based on seasons in the dataset

Finding(s)

When i try to group the data based on seasons the information is not shown

Finding the datatype for each column in this dataset

Finding(s)
The columns Area and Production are objects, but I would like these columns to be numerical values,
So mathimatical operations can be performed with/on them

Therefore, I am going to convert these columns to float datatype.
I chose float as the datatype incase of any values which are decimals/fractions

EDA Next step

Getting statistical data about each column

Getting the number of districts per state

This enables me find out the amount of district names that are associated with each street in this dataset

Renaming the District_name column which contains the count of districts in a state to District_Count

Sorting the districts_count_per_state dataset by District_Count in descending order to see the states with the highest districts

Visualization of the top 20 states with the highest amount of districts

Hypothesis

The states with more districts should have higher crop production than states with less

Findings(s)

I have noticed in this dataset there is data about how much a crop was produced for a whole year
therefore for that information to not affect my analysis, I would be removing the whole year column for this analysis but not for my entire dataset

Conclusion:

My hypothesis about the the states with alot of districts should have the highest amount of crop produced is in-correct

Although according to the graph above the state Uttar pradesh has the highest yield and the squarify graph plotted above showing the how many districts each state has shows that Uttar pradesh is also the state with the most districts but some states e.g(Madhya Pradesh) has the second highest amount of districts but is the 17th highest producing state

Getting amount of crops produced per season,

I do this because, I want to find out the season with the most and least yield and also see the total production amount for all seasons in the dataset

Finding(s)

After grouping the crops produced per season, I found out a season recorded is called whole year which consists of the total amount produced for the whole year,

Solution

I would not like this value when plotting therefore I am going to remove the value from the amount_crops_produced_per_season table

Printing out the value whole year from the season column

Before trying to remove Whole year from Season column, I decided to print out its value from amount_crops_produced_per_season table, I do this to check if the value has any whitespaces or not

Finding(s)

After printing out whole year, I found out this value has whitespaces at the end.

Solution

I have to remove the whitespaces that are present at the end of the text in the column Season from the main dataframe table

Checking if the whitespaces in the season column has been removed

Retrying getting amount of crops produced per season.

Removing whole year season from amount_crops_produced_per_season column

I am dropping the whole year data because I would only like to see how much was produced for each season and not the whole year

Graphical Representaiton showing total amount of crops produced per season

Removal of whole year column from main dataset

I have decided to remove the data regarding whole year from the main dataset,
this is because I do not need this data for my overall analysis and also predictive model

Getting total amount of crops produced per year

I do this because, I want to find out the year with the most and least yield and also see the totla production amount of all years in the dataset

Finding(s)

After trying to group the production amount by each year,
I found out the crop year does not get grouped properly

Finding out why the Crop_Year column does not group by properly

(1) Hypothesis of problem: Crop_year column values ar enot of the same datatype

Conclusion

After checking, I know all the values in the crop_year column are of the datatype object.

(2) Hypothesis of problem: Crop_year column values are not unique

Conclusion

I found out that each value in the column is seen as unique

Solution of why Crop_Year column does not group by properly

By checking if all the values are unique, I found out ater looking closely at the output the values are not of the same datatype.

This is due to the values which the unique functions returned some have quotation marks which signifies stirngs and some dont which signifies numerical values.

Therefore I have to convert the entire column to strings.

Now checking if the values in Crop_Year are unique

Conclusion

After the conversion to string all values are unique.

Continuation:

Getting statistical data about each column

Getting total amount of crops produced per year

Visualization of total amount of crops produced per year

Image of Crops produced per year before whole year data removal

image.png

Finding(s)

From the graph, I have found that there was a huge increase in the amount of crops produced in the year 2010 and 2014, but a huge decline in 2015, therefore I would have to find out why this is so.

I plan on adding another dataset contaning weather conditions for each season per year, this may help me find out why there was a huge increase 2011 and 2013 then a huge decrease in 2015

Finding(s)

After more EDA was done on this dataset, I realised it was the production values from whole year which influenced the spike in the years 2011 and 2014
Once the production values for the whole year season are removed we have a more gradual peak and not a huge spike in production

For each year show the total amount of crops that were produced and the name of the crop

Matplotlib plot

Visualization of total amount of crops produced per thousands for each year

Hypothesis:

The crop which is mostly farmed should have the highest production e.g(rice is being produced by 5 countries and wheat just 2 rice should have more amounts produced)

Due to matplotlib not displaying the crop names on the x axis properly,
I decided to switch to using plotly to visualize the data above

Plotly plot

Visualizaiton of total amount(million) of crops produced per crop for each year

Visualization of total amount(thousand) of crops produced per crop for each year

Ordering the crop count dataframe by crop_count column to see the top crops in this dataset

Visualization of entire Crop-Count

Explanation of newly generated column crop_count:

Crop_count this contains how many times a certain crop appears in the database

For a better view I have filtered out the crop count to only include the top 25 popular crops or the 25 least popular

Finding(s) before removal of whole year season

From the visualization in cell 35 (for each year show the total amount of crops that were produced and the name of the crop), I found that over the years Coconut has the highest amount produced,
but after visualizing the most common crop produced,

I noticed that Coconut which is the crop with the highest production is not a common crop produced
(it is not among the top 25 crops produced)

image.png

Finding(s) after removal of whole year season data

After removing whole year season data I found out over the years coconut is no longer the crop with the highest production anymore and crops which are among the 25 most popular are the highest produced e.g Rice, Maize, Wheat

Getting least 25 Crop_Count

Finding(s):

The top 5 crops in this dataset are Rice, Maize, Moongl(Green gram), Urad, Seasamum
The least 5 crops in this dataset are other dry fruit, Apple, Peach, Plums, Pear

For the top 25 crops I want to find out when each crop had the most yield

Getting for each of the 25 top crops when was their best yield season

Seaborn plot

I decided to use plotly here because of its zooming in functionality, therefore users who have access to my notebook can zoom in easily to data which is too small in the graphical representation

Plotly plot

Finding(s)

From the graph plotted Kharif season has the highest production amount per season, I have decided to go further and research why?

Result: While trying to find out why kharif season has the highest production, I discovered that Kharif and Rabi which are the strange seasons for me in this dataset are not actually weather seasons but they are better known as crop seasons

i.e Kharif is a crop season for crops which grow during Monsoon/Autumn (July to October) Rabi is a crop season for crops which grow during Winter (October to March) Zaid the third crop season in india is a crop season for crops which grow in Summer (March to June)

Note:
Due to Monsoon in India falling into Autumn I have classified both as the same

Data transformation of seasons in the dataset

Creation of new columns called Crop type and Weather seasons

Mapping each data value in the season column to its corresponding weather season

Here I want to map each crop with its corresponding crop type based on India agriculture(Kharif, Zaid, Rabi)

Step1:
Counting how many times a crop is planted during each weather season, I do this because I want to find the most popular season each crop is planted
By getting this I would be able to tell its crop type

Renaming the Crop column after count to Crop_Count

Resetting the index of the newly created dataframe, this is to change Crop and Weather_Seasons column from the index

Getting the highest crop count for each crop

Converting result to a dataframe

Resetting the index of the newly created dataframe, this is to change Crop column from the index

Merging both dataframe to give only data entires that exist in both dataframe

Mapping each crop type to the weather season it is associated with

Dropping weather_seasons in the merged dataframe this is because I do not need this column anymore

Merging my main dataframe df and my crop dataframe which consists of the crop type.
I merge with left because this will keep the data entries of both dataframes

After creating all the required columns from Season, I have decided I will delete this column because it will be of no use anymore

Replotting getting for each of the 25 top crops when was their best yield season

This graph shows when is the best yield season for the top 25 crops

Finding(s)

From the graph above we can see a more logical representation which states that crops which are mostly popular have a high yield that the least.
E.g
Rice is one of the crops that has a lot of farming done on and whene we look at the graph we can see rice has the highest production and this is true for all other crops analysed

To give a better graphcal representation the graphs below are side by side comaprisons

For each state what is the total amount produced for crops produced by that state

I would like to see for each state the crop which had alot of yield overall, therefore giving me information about the crop states in this dataset produce alot

Getting total amount of crops produced per each state

Getting each individual state in the dataset

Visualization showing for each state the total amount produced for crops they produce

Finding(s):

After plotting the graph above, I found that most states have high yield of Rice, Wheat, Maize and Sugarcane

To gather more insight, I created a Visualization showing for each state the total amount produced for crops they produce(below 1million)

Finding a relationship between Area and Production

Hypothesis:

I decided to see if there would be any relationship betweeen area used and amount produced, because this is meant to be logical begin that the higher the area most times the more you produce

Getting the top 30 data entries

Renaming production column in the new dataframe produced to Total_Production(tonne)

Visualization of how much was produced depending on the area of land used for each crop_type

Intial graphical representation

After creating a graph showing the total amount produced for different crop types depending on the area used,
I decided to gather more insight from the data by finding for each state how much was produced for different crop types depending on how much land was used by that state

Visualization of how much was produced per state depending on the area of land used for each crop_type

Findings:
From the graph, I found that when the area of land increases for each crop type there is an increase in how much was produced for that crop type

To gather more insight I decided to plot the graph with the amount produced should be less than a 1000

EDA next step

Finding out numerical analysis

The next step in my EDA is to check if any columns have missing values(i.e null value).

Finding(s)

After checking for null values, I found that only the column Production has values which are null.

Dealing with null values

Due to only one column having null values, I would like to see if the other values of those columns where production is null,
This enables me know if the null value rows will be important or not.

After displaying the rows with null values on the production column, I found that every other data entries associated with the null look important,
therefore I will have to find out why those columns are empty and if I can do anything regarding the rows aside from removing their data entries

Finding correlations with my features using heatmap

Finding(s):

Due to some of my features like weather_season and crop_type are categorical data entries my heatmap does not show a good amount of features,
to expand my heatmap I would be applying one hot encoding on Weather_Seasons and Crop_Type to see if those features have any relationship with my
outcome Production

Applying pandas dummies to implement one hot encoding on the Weather_Seasons column,
I do this because I want to check if the values in weather_seasons have correlations with values from the other numerical columns

Merging the One hot encoded values to the main dataframe

Making a heatmap with the one hot encoded dataframe features

Finding(s)

After creaitng a heatmap to see if any features have correlations with my target feature production(tonne),
I noticed area(ha) has a good correlation with production(tonne)

Exporting current dataset for futher use